Skip to content

retry transient tcp error#4174

Merged
jgao54 merged 3 commits intomainfrom
normalize-err
Apr 14, 2026
Merged

retry transient tcp error#4174
jgao54 merged 3 commits intomainfrom
normalize-err

Conversation

@jgao54
Copy link
Copy Markdown
Contributor

@jgao54 jgao54 commented Apr 14, 2026

This PR fix two things:

  • handle error code: 1001, message: std::__1::ios_base::failure: ios_base::clear: unspecified iostream_category error. This is a transient networking error that can happen in s3 during read and usually recovers on single retry. We want to also categorize it as retryable to avoid starting over the snapshot after a single error.
  • when checking for VM substring, use error.Message() so we can search substring in all the wrapped error messages as well. The current behavior is we would classify as Normalization Error which would also notify, but ViewError is more direct here.

@jgao54 jgao54 requested review from ilidemi and masterashu April 14, 2026 08:51
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 14, 2026

❌ 2 Tests Failed:

Tests completed Failed Passed Skipped
2185 2 2183 196
View the top 3 failed test(s) by shortest run time
github.com/PeerDB-io/peerdb/flow/e2e::TestApiMy
Stack Traces | 0s run time
=== RUN   TestApiMy
=== PAUSE TestApiMy
=== CONT  TestApiMy
--- FAIL: TestApiMy (0.00s)
github.com/PeerDB-io/peerdb/flow/e2e::TestGenericBQ
Stack Traces | 0s run time
=== RUN   TestGenericBQ
=== PAUSE TestGenericBQ
=== CONT  TestGenericBQ
--- FAIL: TestGenericBQ (0.00s)
github.com/PeerDB-io/peerdb/flow/e2e::TestApiMongo
Stack Traces | 0.01s run time
=== RUN   TestApiMongo
=== PAUSE TestApiMongo
=== CONT  TestApiMongo
--- FAIL: TestApiMongo (0.01s)
2026/04/14 20:36:38 INFO Executing and processing query x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart query="SELECT id,val FROM e2e_test_api_qs8euhl4.\"t1\" ORDER BY id"
2026/04/14 20:36:38 INFO Executing and processing query stream x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart query="SELECT id,val FROM e2e_test_api_qs8euhl4.\"t1\" ORDER BY id"
2026/04/14 20:36:38 INFO [pg_query_executor] declared cursor x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart cursorQuery="DECLARE peerdb_cursor_16246957176269917882 CURSOR FOR SELECT id,val FROM e2e_test_api_qs8euhl4.\"t1\" ORDER BY id" args=[]
2026/04/14 20:36:38 INFO [pg_query_executor] fetching rows start x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart query="SELECT id,val FROM e2e_test_api_qs8euhl4.\"t1\" ORDER BY id" channelLen=0
2026/04/14 20:36:38 INFO [pg_query_executor] fetching from cursor x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart cursor=peerdb_cursor_16246957176269917882
2026/04/14 20:36:38 INFO processed row stream x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart cursor=peerdb_cursor_16246957176269917882 records=2 bytes=19 channelLen=1
2026/04/14 20:36:38 INFO [pg_query_executor] fetched rows x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart query="SELECT id,val FROM e2e_test_api_qs8euhl4.\"t1\" ORDER BY id" rows=2 bytes=19 channelLen=1
2026/04/14 20:36:38 INFO [pg_query_executor] fetching from cursor x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart cursor=peerdb_cursor_16246957176269917882
2026/04/14 20:36:38 INFO processed row stream x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart cursor=peerdb_cursor_16246957176269917882 records=0 bytes=0 channelLen=0
2026/04/14 20:36:38 INFO [pg_query_executor] fetched rows x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart query="SELECT id,val FROM e2e_test_api_qs8euhl4.\"t1\" ORDER BY id" rows=0 bytes=0 channelLen=0
2026/04/14 20:36:38 INFO [pg_query_executor] committing transaction x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart
2026/04/14 20:36:38 INFO [pg_query_executor] committed transaction for query x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart query="SELECT id,val FROM e2e_test_api_qs8euhl4.\"t1\" ORDER BY id" rows=2 bytes=19 channelLen=0
github.com/PeerDB-io/peerdb/flow/e2e::TestApiMy/TestCancelTableAdditionRemoveAddRemove
Stack Traces | 22.1s run time
=== RUN   TestApiMy/TestCancelTableAdditionRemoveAddRemove
=== PAUSE TestApiMy/TestCancelTableAdditionRemoveAddRemove
=== CONT  TestApiMy/TestCancelTableAdditionRemoveAddRemove
2026/04/14 20:30:02 INFO Received AWS credentials from peer for connector: ci x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN}
2026/04/14 20:30:02 INFO Received AWS credentials from peer for connector: clickhouse x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN}
2026/04/14 20:30:02 INFO fetched schema x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} table=e2e_test_mych_gxfbxtd4.test_exclude_ch
    cancel_table_addition_test.go:637: WaitFor wait for initial load to finish 2026-04-14 20:30:08.448115732 +0000 UTC m=+259.548528067
    cancel_table_addition_test.go:641: WaitFor t1 2026-04-14 20:30:08.448464086 +0000 UTC m=+259.548876431
2026/04/14 20:30:08 INFO fetched schema x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} table=e2e_test_api_hlqe9ibs.t1
    cancel_table_addition_test.go:642: WaitFor t2 2026-04-14 20:30:08.460484694 +0000 UTC m=+259.560897039
    cancel_table_addition_test.go:82: WaitFor wait for pause for remove e2e_test_api_hlqe9ibs.t2 2026-04-14 20:30:08.474570456 +0000 UTC m=+259.574982801
2026/04/14 20:30:08 INFO fetched schema x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} table=e2e_test_mychclg_0tir1unq.test_simple_schema_changes
    cancel_table_addition_test.go:83: UNEXPECTED ERROR unable to establish connection with catalog: FATAL: terminating connection due to administrator command (SQLSTATE 57P01)
    api_test.go:48: begin tearing down postgres schema api_hlqe9ibs
--- FAIL: TestApiMy/TestCancelTableAdditionRemoveAddRemove (22.13s)
github.com/PeerDB-io/peerdb/flow/e2e::TestApiMongo/TestCancelTableAdditionRemoveAddRemove
Stack Traces | 27.1s run time
=== RUN   TestApiMongo/TestCancelTableAdditionRemoveAddRemove
=== PAUSE TestApiMongo/TestCancelTableAdditionRemoveAddRemove
=== CONT  TestApiMongo/TestCancelTableAdditionRemoveAddRemove
2026/04/14 20:35:21 INFO Received AWS credentials from peer for connector: ci x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN}
2026/04/14 20:35:21 INFO Received AWS credentials from peer for connector: clickhouse x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN}
2026/04/14 20:35:21 INFO fetched schema x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} table=e2e_test_api_c3ysbutl.t1
    cancel_table_addition_test.go:637: WaitFor wait for initial load to finish 2026-04-14 20:35:27.473358265 +0000 UTC m=+596.561897192
    cancel_table_addition_test.go:641: WaitFor t1 2026-04-14 20:35:27.473761152 +0000 UTC m=+596.562300084
    cancel_table_addition_test.go:642: WaitFor t2 2026-04-14 20:35:27.478017915 +0000 UTC m=+596.566556855
    cancel_table_addition_test.go:82: WaitFor wait for pause for remove e2e_test_api_yvpyaayk.t2 2026-04-14 20:35:27.484166648 +0000 UTC m=+596.572705574
2026/04/14 20:35:27 INFO fetched schema x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} table=e2e_test_api_c3ysbutl.t1
    cancel_table_addition_test.go:100: WaitFor wait for table removal of source_table_identifier:"e2e_test_api_yvpyaayk.t2" destination_table_identifier:"t2" to finish 2026-04-14 20:35:43.509681661 +0000 UTC m=+612.598220600
2026/04/14 20:35:43 INFO [pg_query_executor] declared cursor x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart cursorQuery="DECLARE peerdb_cursor_1611523372885743617 CURSOR FOR SELECT id,val FROM e2e_test_api_5ucrwzqt.\"table1\" ORDER BY id" args=[]
2026/04/14 20:35:43 INFO [pg_query_executor] fetching rows start x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart query="SELECT id,val FROM e2e_test_api_5ucrwzqt.\"table1\" ORDER BY id" channelLen=0
2026/04/14 20:35:43 INFO [pg_query_executor] fetching from cursor x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart cursor=peerdb_cursor_1611523372885743617
2026/04/14 20:35:43 INFO processed row stream x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart cursor=peerdb_cursor_1611523372885743617 records=2 bytes=19 channelLen=1
2026/04/14 20:35:43 INFO [pg_query_executor] fetched rows x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart query="SELECT id,val FROM e2e_test_api_5ucrwzqt.\"table1\" ORDER BY id" rows=2 bytes=19 channelLen=1
2026/04/14 20:35:43 INFO [pg_query_executor] fetching from cursor x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart cursor=peerdb_cursor_1611523372885743617
2026/04/14 20:35:43 INFO processed row stream x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart cursor=peerdb_cursor_1611523372885743617 records=0 bytes=0 channelLen=0
2026/04/14 20:35:43 INFO [pg_query_executor] fetched rows x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart query="SELECT id,val FROM e2e_test_api_5ucrwzqt.\"table1\" ORDER BY id" rows=0 bytes=0 channelLen=0
2026/04/14 20:35:43 INFO [pg_query_executor] committing transaction x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart
2026/04/14 20:35:43 INFO [pg_query_executor] committed transaction for query x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart query="SELECT id,val FROM e2e_test_api_5ucrwzqt.\"table1\" ORDER BY id" rows=2 bytes=19 channelLen=0
    cancel_table_addition_test.go:127: WaitFor wait for pause for add e2e_test_api_yvpyaayk.t2 2026-04-14 20:35:44.51465261 +0000 UTC m=+613.603191542
2026/04/14 20:35:44 INFO Executing and processing query x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart query="SELECT id,val FROM e2e_test_api_5ucrwzqt.\"table1\" ORDER BY id"
2026/04/14 20:35:44 INFO Executing and processing query stream x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart query="SELECT id,val FROM e2e_test_api_5ucrwzqt.\"table1\" ORDER BY id"
2026/04/14 20:35:44 INFO [pg_query_executor] declared cursor x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart cursorQuery="DECLARE peerdb_cursor_12950402609540204079 CURSOR FOR SELECT id,val FROM e2e_test_api_5ucrwzqt.\"table1\" ORDER BY id" args=[]
2026/04/14 20:35:44 INFO [pg_query_executor] fetching rows start x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart query="SELECT id,val FROM e2e_test_api_5ucrwzqt.\"table1\" ORDER BY id" channelLen=0
2026/04/14 20:35:44 INFO [pg_query_executor] fetching from cursor x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart cursor=peerdb_cursor_12950402609540204079
2026/04/14 20:35:44 INFO processed row stream x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart cursor=peerdb_cursor_12950402609540204079 records=2 bytes=19 channelLen=1
2026/04/14 20:35:44 INFO [pg_query_executor] fetched rows x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart query="SELECT id,val FROM e2e_test_api_5ucrwzqt.\"table1\" ORDER BY id" rows=2 bytes=19 channelLen=1
2026/04/14 20:35:44 INFO [pg_query_executor] fetching from cursor x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart cursor=peerdb_cursor_12950402609540204079
2026/04/14 20:35:44 INFO processed row stream x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart cursor=peerdb_cursor_12950402609540204079 records=0 bytes=0 channelLen=0
2026/04/14 20:35:44 INFO [pg_query_executor] fetched rows x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart query="SELECT id,val FROM e2e_test_api_5ucrwzqt.\"table1\" ORDER BY id" rows=0 bytes=0 channelLen=0
2026/04/14 20:35:44 INFO [pg_query_executor] committing transaction x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart
2026/04/14 20:35:44 INFO [pg_query_executor] committed transaction for query x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart query="SELECT id,val FROM e2e_test_api_5ucrwzqt.\"table1\" ORDER BY id" rows=2 bytes=19 channelLen=0
    cancel_table_addition_test.go:128: UNEXPECTED ERROR unable to establish connection with catalog: FATAL: terminating connection due to administrator command (SQLSTATE 57P01)
    api_test.go:48: begin tearing down postgres schema api_yvpyaayk
--- FAIL: TestApiMongo/TestCancelTableAdditionRemoveAddRemove (27.12s)
github.com/PeerDB-io/peerdb/flow/e2e::TestGenericBQ/Test_Simple_Flow
Stack Traces | 33.6s run time
=== RUN   TestGenericBQ/Test_Simple_Flow
=== PAUSE TestGenericBQ/Test_Simple_Flow
=== CONT  TestGenericBQ/Test_Simple_Flow
    generic_test.go:124: UNEXPECTED STATUS TIMEOUT STATUS_SNAPSHOT
    bigquery.go:86: begin tearing down postgres schema bq_ipllifnk_20260414204626
--- FAIL: TestGenericBQ/Test_Simple_Flow (33.63s)

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

@github-actions
Copy link
Copy Markdown
Contributor

🔄 Flaky Test Detected

Analysis: Both failures are caused by SQLSTATE 57P01 (PostgreSQL admin shutdown), a transient infrastructure event where the database connection was terminated externally — not a code logic bug.
Confidence: 0.95

✅ Automatically retrying the workflow

View workflow run

Comment thread flow/otel_metrics/otel_manager.go Outdated
@github-actions
Copy link
Copy Markdown
Contributor

🔄 Flaky Test Detected

Analysis: Both failures stem from PostgreSQL connection termination (SQLSTATE 57P01 — "terminating connection due to administrator command") and an active replication slot blocking teardown (SQLSTATE 55006), which are transient infrastructure/resource-contention issues in the concurrent e2e test environment, not code logic bugs.
Confidence: 0.92

✅ Automatically retrying the workflow

View workflow run

@github-actions
Copy link
Copy Markdown
Contributor

🔄 Flaky Test Detected

Analysis: Both failures are transient: one is a snapshot status timeout (async timing issue) and the other is a PostgreSQL connection terminated by admin shutdown (SQLSTATE 57P01), neither indicating a code regression.
Confidence: 0.93

✅ Automatically retrying the workflow

View workflow run

Comment thread flow/pkg/clickhouse/query_retry.go Outdated

// conditionallyRetryableExceptions are error codes that are only retryable
// when the error message contains one of the specified substrings
var conditionallyRetryableExceptions = map[chproto.Error][]string{
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an idea, how about retryableExceptionSubstrings? Wouldn't need an explanation this way

@jgao54 jgao54 enabled auto-merge (squash) April 14, 2026 20:18
@github-actions
Copy link
Copy Markdown
Contributor

🔄 Flaky Test Detected

Analysis: The test failed due to a transient PostgreSQL connection drop (SQLSTATE 57P01 — admin_shutdown), where the catalog DB connection was terminated by an administrator command mid-test, indicating a CI infrastructure issue rather than a code bug.
Confidence: 0.95

✅ Automatically retrying the workflow

View workflow run

@github-actions
Copy link
Copy Markdown
Contributor

🔄 Flaky Test Detected

Analysis: TestGenericBQ/Test_Simple_Flow timed out waiting for STATUS_SNAPSHOT to complete — a classic transient timeout in an e2e test that depends on BigQuery and external services, with only 2 failures out of 2348 tests and the codebase itself flagging BigQuery tests as flaky under high concurrency.
Confidence: 0.92

✅ Automatically retrying the workflow

View workflow run

@jgao54 jgao54 merged commit dbfbd2d into main Apr 14, 2026
17 of 20 checks passed
@jgao54 jgao54 deleted the normalize-err branch April 14, 2026 21:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants